Fast Label Embeddings for Extremely Large Output Spaces

نویسندگان

  • Paul Mineiro
  • Nikos Karampatziakis
چکیده

Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results. 1 Contributions We provide a statistical motivation for label embedding by demonstrating that the optimal rank-constrained least squares estimator can be constructed from an optimal unconstrained estimator of an embedding of the labels. Thus, embedding can provide beneficial sample complexity reduction even if computational constraints are not binding. We identify a natural object to define label similarity: the expected outer product of the conditional label probabilities. In particular, in conjunction with a low-rank constraint, this indicates two label embeddings are similar when their conditional probabilities are linearly dependent across the dataset. This unifies prior work utilizing the confusion matrix for multiclass [1] and the empirical label covariance for multilabel [5]. We apply techniques from randomized linear algebra [3] to develop an efficient and scalable algorithm for constructing the embeddings, essentially via a novel randomized algorithm. Intuitively, this technique implicitly decomposes the prediction matrix of a model which would be prohibitively expensive to form explicitly. 2 Proposed Algorithm Our proposal is Rembrandt, described in Algorithm 1. We use the top right singular space of ΠX,LY as a label embedding, or equivalently, the top principal components of Y ΠX,LY . Using randomized techniques, we can Algorithm 1 Rembrandt: Response EMBedding via RANDomized Techniques 1: function REMBRANDT(k,X ∈ Rn×d, Y ∈ Rn×c) 2: (p, q)← (20, 1) . These hyperparameters rarely need adjustment. 3: Q← randn(c, k + p) 4: for i ∈ {1, . . . , q} do . Randomized range finder for Y ΠX,LY 5: Z ← arg min ‖Y Q−XZ‖F 6: Q← orthogonalize(Y >XZ) 7: end for . NB: total of (q + 1) data passes, including next line 8: F ← (Y >XQ)>(Y >XQ) . F ∈ R(k+p)×(k+p) is “small” 9: (V,Σ)← eig(F, k) 10: V ← QV . V ∈ Rc×k is the embedding 11: return (V,Σ) 12: end function

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Label Embeddings via Randomized Linear Algebra

Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorit...

متن کامل

Locally Non-linear Embeddings for Extreme Multi-label Learning

The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point with the most relevant subset of labels from an extremely large label set. Embedding based approaches make training and prediction tractable by assuming that the training label matrix is low-rank and hence the effective number of labels can be reduced by projecting the high dimen...

متن کامل

HAMLET: Interpretable Human And Machine co-LEarning Technique

Efficient label acquisition processes are key to obtaining robust classifiers. However, data labeling is often challenging and subject to high levels of label noise. This can arise even when classification targets are well defined, if instances to be labeled are more difficult than the prototypes used to define the class, leading to disagreements among the expert community. Here, we enable effi...

متن کامل

Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces

We combine multi-task learning and semisupervised learning by inducing a joint embedding space between disparate label spaces and learning transfer functions between label embeddings, enabling us to jointly leverage unlabelled data and auxiliary, annotated datasets. We evaluate our approach on a variety of sequence classification tasks with disparate label spaces. We outperform strong single an...

متن کامل

Sparse Local Embeddings for Extreme Multi-label Classification

The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point with the most relevant subset of labels from an extremely large label set. Embedding based approaches attempt to make training and prediction tractable by assuming that the training label matrix is low-rank and reducing the effective number of labels by projecting the high dimens...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1503.08873  شماره 

صفحات  -

تاریخ انتشار 2014